With the growing presence of technology in our society, there is an rapidly increasing demand for hardware which supports our heavy computational demands. One of the most important pieces of computer hardware for computationally-intensive tasks is Graphics Processing Units (GPU) due to their ability to handle a wide range of parallel processing tasks — this has made them an invaluable resource for companies pursing any sort of Artificial Intelligence (AI), super-computing, crypto-currencies, or computer graphics. Unfortunately, the materials needed to produce GPUs are somewhat scarce, thus leading to a small pool of manufacturers that experience significant competition.
The purpose of this project is to attempt to predict the price-trends of a fixed semiconductor stock (in this case, that of NVIDIA) based on the performance of its competitors, previous pricing, and volume of shares sold. A variety of statistical learning models will be used, ranging from standard regression techniques to more non-linear models like random forest learning and k-Nearest neighbors.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(tidymodels)
library(ggplot2)
library(corrplot)
library(discrim)
library(ggthemes)
library(kableExtra)
library(yardstick)
library(visdat)
library(scales)
library(glmnet)
tidymodels_prefer()
conflicted::conflicts_prefer(yardstick::rsq)
set.seed(3435)
This dataset is comprised of 1 year’s worth ( 252 business / trading days ) of New York Stock Exchange data for twelve of the most popular semiconductor manufacturers: Advanced Micro Devices (AMD), Applied Materials Inc. (AMAT), ASML Holding N.V. (ASML), Broadcom Inc. (AVGO), Intel Corporation (INTC), Monolithic Power Systems Inc. (MPWR), Nvidia Corp. (NVDA), NXP Semiconductors NV (NXPI), On Semiconductor Corp. (ON), Qualcomm Inc. (QCOM), Taiwan Semiconductor Manufacturing Co. Ltd. (TSM), and Texas Instruments Inc (TXN).
All stock market data was obtained from Yahoo Finance. Each company’s one-year historical stock data was individually pulled from Yahoo’s historical data on April 12th, 2024. For example, AMD’s stock prices were obtained by downloading the CSV file from AMD’s Historical Data page, which results in a dataframe with the following variables and entries:
read.csv("data/AMD.csv") %>%
head() %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| Date | Open | High | Low | Close | Adj.Close | Volume |
|---|---|---|---|---|---|---|
| 2023-04-13 | 92.79 | 93.16 | 91.83 | 92.09 | 92.09 | 40572500 |
| 2023-04-14 | 91.82 | 92.97 | 90.50 | 91.75 | 91.75 | 38734800 |
| 2023-04-17 | 90.23 | 90.69 | 88.30 | 89.87 | 89.87 | 47250800 |
| 2023-04-18 | 91.61 | 92.16 | 89.33 | 89.78 | 89.78 | 46246300 |
| 2023-04-19 | 88.51 | 90.54 | 88.22 | 89.94 | 89.94 | 37344500 |
| 2023-04-20 | 88.83 | 91.58 | 88.73 | 90.11 | 90.11 | 47082700 |
However, since one goal of this analysis is to test the affect of competitor’s stock performance on a fixed GPU manufacturer’s stock price, multiple CSV files must be stored into raw data. The easiest way to do this was to create a separate CSV file, with header columns renamed to both resolve variable name conflicts and to distinguish the data specific to certain stocks. This was simply done by adding the stock’s symbol (i.e. AMD, INTC, etc.) to the beginning of the original variable name:
# Read the data into a dataframe variable 'SSD'
SSD <- read.csv("data/semiconductor_stock_data_mod.csv")
SSD$Date <- as.Date(SSD$Date, format="%m/%d/%y")
SSD %>%
head() %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| Date | NVDA_Open | NVDA_High | NVDA_Low | NVDA_Close | NVDA_Adj_Close | NVDA_Volume | TSM_Open | TSM_High | TSM_Low | TSM_Close | TSM_Adj_Close | TSM_Volume | NXPI_Open | NXPI_High | NXPI_Low | NXPI_Close | NXPI_Adj_Close | NXPI_Volume | QCOM_Open | QCOM_High | QCOM_Low | QCOM_Close | QCOM_Adj_Close | QCOM_Volume | MPWR_Open | MPWR_High | MPWR_Low | MPWR_Close | MPWR_Adj_Close | MPWR_Volume | ON_Open | ON_High | ON_Low | ON_Close | ON_Adj_Close | ON_Volume | AMD_Open | AMD_High | AMD_Low | AMD_Close | AMD_Adj_Close | AMD_Volume | INTC_Open | INTC_High | INTC_Low | INTC_Close | INTC_Adj_Close | INTC_Volume | AVGO_Open | AVGO_High | AVGO_Low | AVGO_Close | AVGO_Adj_Close | AVGO_Volume | ASML_Open | ASML_High | ASML_Low | ASML_Close | ASML_Adj_Close | ASML_Volume | AMAT_Open | AMAT_High | AMAT_Low | AMAT_Close | AMAT_Adj_Close | AMAT_Volume | TXN_Open | TXN_High | TXN_Low | TXN_Close | TXN_Adj_Close | TXN_Volume |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2021-04-12 | 142.8975 | 153.5250 | 141.3925 | 152.0900 | 151.7933 | 86932400 | 122.21 | 122.46 | 119.24 | 120.90 | 114.2340 | 9868400 | 207.98 | 208.22 | 204.87 | 207.94 | 197.1795 | 1686200 | 138.86 | 139.89 | 136.05 | 137.44 | 128.6173 | 10355500 | 375.41 | 377.50 | 367.59 | 373.92 | 366.0652 | 283100 | 42.42 | 42.70 | 41.66 | 42.40 | 42.40 | 4151300 | 82.06 | 82.18 | 78.03 | 78.58 | 78.58 | 62098800 | 68.20 | 68.49 | 64.71 | 65.41 | 59.86862 | 51266900 | 481.93 | 485.42 | 478.60 | 483.67 | 446.5351 | 2324700 | 631.56 | 631.58 | 620.62 | 630.43 | 610.5892 | 739500 | 137.82 | 138.73 | 134.51 | 135.00 | 131.6226 | 11147500 | 192.64 | 194.72 | 191.34 | 192.43 | 175.6870 | 4504800 |
| 2021-04-13 | 152.3150 | 157.0000 | 151.2575 | 156.7950 | 156.4891 | 67621200 | 122.40 | 122.90 | 120.35 | 121.27 | 114.5836 | 8384800 | 207.32 | 207.79 | 200.72 | 202.00 | 191.5468 | 3033300 | 138.38 | 138.77 | 135.75 | 137.30 | 128.4863 | 9225200 | 374.97 | 379.23 | 371.30 | 376.52 | 368.6105 | 192700 | 42.54 | 42.85 | 41.55 | 42.04 | 42.04 | 3625300 | 79.67 | 80.72 | 78.98 | 80.19 | 80.19 | 37767300 | 65.61 | 65.63 | 64.21 | 65.22 | 59.69471 | 26822000 | 485.00 | 488.22 | 480.29 | 484.96 | 447.7260 | 1528100 | 635.63 | 636.74 | 623.72 | 629.12 | 609.3205 | 710500 | 136.63 | 136.99 | 133.20 | 135.10 | 131.7201 | 8034300 | 192.14 | 193.00 | 189.76 | 191.24 | 174.6006 | 4009300 |
| 2021-04-14 | 156.2500 | 157.2050 | 152.2750 | 152.7700 | 152.4719 | 38550000 | 121.99 | 122.43 | 120.50 | 120.84 | 114.1773 | 9521900 | 201.20 | 203.33 | 198.59 | 199.89 | 189.5460 | 2879000 | 137.08 | 137.84 | 133.91 | 134.75 | 126.1000 | 9967800 | 375.00 | 383.00 | 369.10 | 369.92 | 362.1492 | 251900 | 41.59 | 43.17 | 41.44 | 42.10 | 42.10 | 3889800 | 79.88 | 80.13 | 77.94 | 78.55 | 78.55 | 34263800 | 65.31 | 65.38 | 63.84 | 64.19 | 58.75197 | 25768400 | 482.47 | 489.19 | 475.19 | 477.30 | 440.6541 | 1822000 | 635.67 | 641.09 | 627.04 | 630.99 | 611.1315 | 718000 | 134.67 | 137.14 | 133.24 | 134.14 | 130.7841 | 8134200 | 190.46 | 191.50 | 189.01 | 190.33 | 173.7697 | 3555000 |
| 2021-04-15 | 156.6250 | 162.1425 | 156.3150 | 161.3725 | 161.0576 | 59848000 | 121.70 | 122.00 | 116.56 | 118.35 | 111.8246 | 18709100 | 202.76 | 202.76 | 198.53 | 201.77 | 191.3288 | 1858100 | 136.00 | 137.99 | 135.57 | 137.84 | 128.9916 | 11733200 | 374.64 | 383.64 | 374.32 | 381.57 | 373.5545 | 237000 | 42.44 | 42.90 | 41.94 | 42.63 | 42.63 | 3995000 | 80.32 | 83.95 | 79.97 | 83.01 | 83.01 | 68942800 | 63.97 | 65.22 | 63.68 | 65.02 | 59.51164 | 24927700 | 481.64 | 482.31 | 476.78 | 480.00 | 443.1469 | 1837000 | 633.78 | 642.90 | 627.52 | 642.09 | 621.8823 | 980400 | 136.00 | 136.14 | 132.85 | 134.41 | 131.0473 | 8269400 | 191.93 | 193.53 | 190.82 | 193.17 | 176.3626 | 4471900 |
| 2021-04-16 | 160.5300 | 161.6575 | 158.6525 | 159.1250 | 158.8145 | 33520800 | 119.19 | 120.60 | 117.85 | 118.84 | 112.2876 | 9512100 | 201.19 | 202.08 | 199.01 | 199.38 | 189.0624 | 2187600 | 137.62 | 139.01 | 136.65 | 138.21 | 129.3378 | 6583900 | 382.12 | 386.52 | 375.38 | 378.62 | 370.6664 | 322000 | 42.63 | 42.77 | 42.07 | 42.18 | 42.18 | 4630300 | 83.30 | 83.59 | 81.53 | 82.15 | 82.15 | 47280600 | 65.33 | 65.52 | 64.57 | 64.75 | 59.26453 | 24625500 | 480.48 | 481.78 | 476.78 | 478.79 | 442.0297 | 1626300 | 640.26 | 647.94 | 638.48 | 645.69 | 625.3690 | 605200 | 133.50 | 134.74 | 133.01 | 133.73 | 130.3843 | 7686300 | 193.66 | 194.78 | 191.64 | 191.93 | 175.2305 | 5792900 |
| 2021-04-19 | 155.3650 | 158.0750 | 152.3300 | 153.6175 | 153.3178 | 40442000 | 118.00 | 118.88 | 115.20 | 115.40 | 109.0372 | 12630300 | 199.37 | 199.56 | 191.99 | 194.76 | 184.6815 | 2686000 | 136.90 | 137.05 | 134.09 | 135.25 | 126.5678 | 8728900 | 377.03 | 379.21 | 362.76 | 369.60 | 361.8359 | 238900 | 42.00 | 42.53 | 40.58 | 41.07 | 41.07 | 4554600 | 82.13 | 83.18 | 80.39 | 81.11 | 81.11 | 39115500 | 64.70 | 64.74 | 63.07 | 63.63 | 58.23941 | 23997700 | 476.53 | 476.80 | 460.05 | 462.00 | 426.5288 | 2631900 | 637.64 | 639.26 | 622.45 | 630.11 | 610.2793 | 1138900 | 133.39 | 135.28 | 128.70 | 130.89 | 127.6153 | 12826400 | 190.35 | 191.10 | 186.72 | 187.06 | 170.7843 | 5334900 |
This results in a seemingly large initial data frame that contains
different 73 predictors and 252 entries (corresponding to the 252 days
that the stock market is open throughout the fiscal year). Out of the 73
predictors is 1 date variable (which is formatted as a Date data type
using as.Date() ), and 6 predictors for each of the 12
chosen semiconductor manufacturers.
dim(SSD)
## [1] 757 73
Fortunately, a quick analysis shows that there is no missing data among any of the CSV files downloaded. This is somewhat expected though, since stock market data is meant to be as publicly available as possible and the original features are fairly common metrics for financial institutions to collect.
vis_miss(SSD)
We examine the data in terms of the predictors that are given to us, and then see if there are any other possible metrics to analyze our stock prices by. First we wish to explain the relevance of each predictor in the initial data frame, though not all variables will be used in our predictive models due to a high correlation (for example, the previous day’s closing price is heavily correlated to the current day’s opening price). Next, we examine other possible methods to predict our stocks behavior by looking at both historical metrics and normalized metrics.
For each of the 12 semiconductor manufacturers chosen (AMAT, AMD, ASML, AVGO, INTC, MPWR, NVDA, NXPI, ON, QCOM, TSM, and TXN), there is a variable called xx_Open (where xx is one of the above stock symbols) which corresponds to that stock’s opening price for the day. As the New York Stock exchange operates from 9:30AM to 4:00PM, this indicates the stock’s price at 9:30AM that day.
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_Open, color = 'AMD')) +
geom_line(aes(y = NXPI_Open, color = 'NXPI')) +
geom_line(aes(y = TXN_Open, color = 'TXN')) +
geom_line(aes(y = AMAT_Open, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("Opening Price") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_Open, color = 'NVDA')) +
geom_line(aes(y = MPWR_Open, color = 'MPWR')) +
geom_line(aes(y = AVGO_Open, color = 'AVGO')) +
geom_line(aes(y = ASML_Open, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("Opening Price") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_Open, color = 'ON')) +
geom_line(aes(y = QCOM_Open, color = 'QCOM')) +
geom_line(aes(y = INTC_Open, color = 'INTC')) +
geom_line(aes(y = TSM_Open, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("Opening Price") +
theme_dark()
Similar to the variable xx_Open, the predictor xx_Close simply represents the manufacturer’s stock price at closing time (4:00PM) of the Stock Exchange that given day. The 252-day trend between the opening and closing prices are almost indistinguishable:
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_Close, color = 'AMD')) +
geom_line(aes(y = NXPI_Close, color = 'NXPI')) +
geom_line(aes(y = TXN_Close, color = 'TXN')) +
geom_line(aes(y = AMAT_Close, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("Closing Price") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_Close, color = 'NVDA')) +
geom_line(aes(y = MPWR_Close, color = 'MPWR')) +
geom_line(aes(y = AVGO_Close, color = 'AVGO')) +
geom_line(aes(y = ASML_Close, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("Closing Price") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_Close, color = 'ON')) +
geom_line(aes(y = QCOM_Close, color = 'QCOM')) +
geom_line(aes(y = INTC_Close, color = 'INTC')) +
geom_line(aes(y = TSM_Close, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("Closing Price") +
theme_dark()
There are ultimately a few noticeable differences close to extrema (maxima and minima) several of the stocks, but this simply reflects the fact that stock prices become volatile after an extended period of growth (i.e a ‘bubble’) or decay.
The variables of the format xx_Adj_Close represent the ‘Adjusted closing prices’ of the respective stocks; though closely related to the closing price, the adjusted closing price takes into account any corporate actions that stock may have undergone that day. For example, this accounts for stock splits, dividends, and rights offerings. Those with a deeper financial knowledge are sometimes able to leverage the difference between a stock’s closing price and adjusted closing price to establish a metric on a company’s profitability — however, no such techniques will be used in this analysis.
It should also be noted that neither the adjusted closing price nor the regular closing price are necessarily equal to the opening price the next day — this simply reflects the fact that the public’s valuation of a given stock is constantly changing even outside the stock exchange’s usual hours.
The variables xx_High and xx_Low represent the maximum and minimum values, respectively, the stock reached on that particular day. Since a continuous plot of stocks’ value is not readily available, taking the difference of these two values (i.e. the stocks movement over a day) is one possible way of predicting how volatile a certain stock is over a period of time.
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = (AMD_High - AMD_Low), color = 'AMD')) +
geom_line(aes(y = (NXPI_High - NXPI_Low), color = 'NXPI')) +
geom_line(aes(y = (TXN_High - TXN_Low), color = 'TXN')) +
geom_line(aes(y = (AMAT_High - AMAT_Low), color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("High - Low") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = (NVDA_High - NVDA_Low), color = 'NVDA')) +
geom_line(aes(y = (MPWR_High - MPWR_Low), color = 'MPWR')) +
geom_line(aes(y = (AVGO_High - AVGO_Low), color = 'AVGO')) +
geom_line(aes(y = (ASML_High - ASML_Low), color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("High - Low") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = (ON_High - ON_Low), color = 'ON')) +
geom_line(aes(y = (QCOM_High - QCOM_Low), color = 'QCOM')) +
geom_line(aes(y = (INTC_High - INTC_Low), color = 'INTC')) +
geom_line(aes(y = (TSM_High - TSM_Low), color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("High - Low ") +
theme_dark()
It should be noted, however, that stocks’ movement (i.e. High minus Low) is not always the best way to directly compare two stocks, since their average prices could vary drastically. For example, the AVGO stock attains values over 1000 USD per share while Intel Corporation (INTC) regularly holds its share price just under $50 — thus, if both stocks fluctuate over a given day by 1% of their total value, the movement of AVGO will appear as significantly more drastic than INTC due to the fact that AVGO’s shares are worth 20 times that of INTC. Ultimately this will not be of importance later on in the model fitting stage, since all numeric variables will be rescaled in the recipe creation.
Lastly, the variables ending with Volume indicate the number of stock shares that are traded (i.e. either bought or sold) on that given day. As the only predictor in our data-set not measured in terms of a currency, volume gives useful insights into a company’s popularity and thus potential future trends for that stock.
data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[1], SSD$AMD_Volume[1], SSD$ASML_Volume[1], SSD$AVGO_Volume[1], SSD$INTC_Volume[1], SSD$MPWR_Volume[1], SSD$NVDA_Volume[1], SSD$NXPI_Volume[1], SSD$ON_Volume[1], SSD$QCOM_Volume[1], SSD$TSM_Volume[1], SSD$TXN_Volume[1] ) ) %>% ggplot( aes(x=name, y=vols)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 80000000), labels = label_comma()) +
ylab('') +
xlab('') +
ggtitle("Volume of Stocks Sold on 4/13/2023")
data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[126], SSD$AMD_Volume[126], SSD$ASML_Volume[126], SSD$AVGO_Volume[126], SSD$INTC_Volume[126], SSD$MPWR_Volume[126], SSD$NVDA_Volume[126], SSD$NXPI_Volume[126], SSD$ON_Volume[126], SSD$QCOM_Volume[126], SSD$TSM_Volume[126], SSD$TXN_Volume[126] ) ) %>% ggplot( aes(x=name, y=vols)) +
geom_bar(stat = "identity") +
scale_y_continuous(limits = c(0, 80000000), labels = label_comma()) +
ylab('') +
xlab('') +
ggtitle("Volume of Stocks Sold on 10/10/2023")
data.frame(name=c("AMAT", "AMD", "ASML", "AVGO", "INTC", "MPWR", "NVDA", "NXPI", "ON", "QCOM", "TSM", "TXN"), vols=c( SSD$AMAT_Volume[252], SSD$AMD_Volume[252], SSD$ASML_Volume[252], SSD$AVGO_Volume[252], SSD$INTC_Volume[252], SSD$MPWR_Volume[252], SSD$NVDA_Volume[252], SSD$NXPI_Volume[252], SSD$ON_Volume[252], SSD$QCOM_Volume[252], SSD$TSM_Volume[252], SSD$TXN_Volume[252] ) ) %>% ggplot( aes(x=name, y=vols)) +
geom_bar(stat = "identity") +
scale_y_continuous(labels = label_comma()) +
ylab('') +
xlab('') +
ggtitle("Volume of Stocks Sold on 4/12/2024")
Based on the above plots, one can also see that the semiconductor manufacturing market is primarily dominated by three corporations: AMD, Intel (INTC), and NVIDIA (NVDA).
While the six predictors provided by Yahoo Finance give significant insight into each stock’s historical performance over the year, there may be other, more useful metrics that we can use to assess and predict the future growth of our stocks. The main kind of variables we wish to introduce are ones which simply keep track of data from previous days; since each predictor in our original data frame only applies to a 24-hour window, there could be some potentially important information in the long-term trends of a stock which ultimately affect a share’s price.
As consumers use historical stock data to determine whether a certain stock is worth buying or not, it becomes apparent that stocks’ price is, in one way or another, dependent on its previous value. While this is technically true for any continuous function / continuous random variable, it is clear that even long-term data can affect a stock’s current value — for example, if a stock has been in a steady downward trend for quite some time, it will negatively affect the perception of potential investors.
While there are multiple financial metrics which account for previous stock prices, this analysis will only look at two basic measurements: the n-day average and the n-day standard deviation (where n is some integer-valued hyper-parameter). Although there are subtle differences between the opening price and the closing price of a stock, the larger the value of n is (in our n-day average) the less the distinction should matter in terms of which variable to average; for consistency, we will simply base our new metrics on the closing costs of each stock.
Additionally, there is no clear choice for how much previous data to account for — should the analysis look back at a single week’s worth of data or a month? As this is itself an interesting question for the sake of tuning our models, we will consider this an added hyperparameter for the problem and consider four possible values: 1 week, 2 weeks, 1 month, and 2 months.
running_average <- function(my_vec, num_days) {
#' Takes the running average of a column vector
#'
#' Creates a new column vector whose entries are the average of the previous num_days entries.
#' When not enough data is available to take the average over num_days, the closest possible
#' average will be taken (for example, if num_days = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the average values of
#' @param num_days the number of days one wishes to average over
#'
#' @return A vector whose entries represent the average of the previous num_days entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of running_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(num_days) == FALSE || num_days != round(num_days)){
stop("Not Integer: Second argument of running_average must be an integer larger than or equal to 2")
}
if(num_days <= 1){
stop("Not Large Enough: Second argument of running_average must be an integer larger than or equal to 2")
}
# dummy variable to keep track of sums
sum_counter = 0
# return variable
output_vec = c()
for (i in 1:length(my_vec)) {
# If there are less that num_days of data previous to the current date,
# simply take the average of all the days prior to get the closest thing
# to a running average
if (i <= num_days){
sum_counter = sum_counter + my_vec[i]
output_vec[i] = sum_counter / i
}
else {
# Add the next day to the sum
sum_counter = sum_counter + my_vec[i]
# Subtract the data from two weeks prior
sum_counter = sum_counter - my_vec[i-num_days]
output_vec[i] = sum_counter / num_days
}
}
return(output_vec)
}
SSD$NVDA_avg_cl_1W <- running_average(SSD$NVDA_Close, 5)
SSD$TSM_avg_cl_1W <- running_average(SSD$TSM_Close, 5)
SSD$NXPI_avg_cl_1W <- running_average(SSD$NXPI_Close, 5)
SSD$QCOM_avg_cl_1W <- running_average(SSD$QCOM_Close, 5)
SSD$MPWR_avg_cl_1W <- running_average(SSD$MPWR_Close, 5)
SSD$ON_avg_cl_1W <- running_average(SSD$ON_Close, 5)
SSD$AMD_avg_cl_1W <- running_average(SSD$AMD_Close, 5)
SSD$INTC_avg_cl_1W <- running_average(SSD$INTC_Close, 5)
SSD$AVGO_avg_cl_1W <- running_average(SSD$AVGO_Close, 5)
SSD$ASML_avg_cl_1W <- running_average(SSD$ASML_Close, 5)
SSD$AMAT_avg_cl_1W <- running_average(SSD$AMAT_Close, 5)
SSD$TXN_avg_cl_1W <- running_average(SSD$TXN_Close, 5)
SSD$NVDA_avg_cl_2W <- running_average(SSD$NVDA_Close, 10)
SSD$TSM_avg_cl_2W <- running_average(SSD$TSM_Close, 10)
SSD$NXPI_avg_cl_2W <- running_average(SSD$NXPI_Close, 10)
SSD$QCOM_avg_cl_2W <- running_average(SSD$QCOM_Close, 10)
SSD$MPWR_avg_cl_2W <- running_average(SSD$MPWR_Close, 10)
SSD$ON_avg_cl_2W <- running_average(SSD$ON_Close, 10)
SSD$AMD_avg_cl_2W <- running_average(SSD$AMD_Close, 10)
SSD$INTC_avg_cl_2W <- running_average(SSD$INTC_Close, 10)
SSD$AVGO_avg_cl_2W <- running_average(SSD$AVGO_Close, 10)
SSD$ASML_avg_cl_2W <- running_average(SSD$ASML_Close, 10)
SSD$AMAT_avg_cl_2W <- running_average(SSD$AMAT_Close, 10)
SSD$TXN_avg_cl_2W <- running_average(SSD$TXN_Close, 10)
SSD$NVDA_avg_cl_1M <- running_average(SSD$NVDA_Close, 20)
SSD$TSM_avg_cl_1M <- running_average(SSD$TSM_Close, 20)
SSD$NXPI_avg_cl_1M <- running_average(SSD$NXPI_Close, 20)
SSD$QCOM_avg_cl_1M <- running_average(SSD$QCOM_Close, 20)
SSD$MPWR_avg_cl_1M <- running_average(SSD$MPWR_Close, 20)
SSD$ON_avg_cl_1M <- running_average(SSD$ON_Close, 20)
SSD$AMD_avg_cl_1M <- running_average(SSD$AMD_Close, 20)
SSD$INTC_avg_cl_1M <- running_average(SSD$INTC_Close, 20)
SSD$AVGO_avg_cl_1M <- running_average(SSD$AVGO_Close, 20)
SSD$ASML_avg_cl_1M <- running_average(SSD$ASML_Close, 20)
SSD$AMAT_avg_cl_1M <- running_average(SSD$AMAT_Close, 20)
SSD$TXN_avg_cl_1M <- running_average(SSD$TXN_Close, 20)
SSD$NVDA_avg_cl_2M <- running_average(SSD$NVDA_Close, 40)
SSD$TSM_avg_cl_2M <- running_average(SSD$TSM_Close, 40)
SSD$NXPI_avg_cl_2M <- running_average(SSD$NXPI_Close, 40)
SSD$QCOM_avg_cl_2M <- running_average(SSD$QCOM_Close, 40)
SSD$MPWR_avg_cl_2M <- running_average(SSD$MPWR_Close, 40)
SSD$ON_avg_cl_2M <- running_average(SSD$ON_Close, 40)
SSD$AMD_avg_cl_2M <- running_average(SSD$AMD_Close, 40)
SSD$INTC_avg_cl_2M <- running_average(SSD$INTC_Close, 40)
SSD$AVGO_avg_cl_2M <- running_average(SSD$AVGO_Close, 40)
SSD$ASML_avg_cl_2M <- running_average(SSD$ASML_Close, 40)
SSD$AMAT_avg_cl_2M <- running_average(SSD$AMAT_Close, 40)
SSD$TXN_avg_cl_2M <- running_average(SSD$TXN_Close, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_avg_cl_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_avg_cl_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_avg_cl_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_avg_cl_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("2-Week Average") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_avg_cl_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_avg_cl_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_avg_cl_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_avg_cl_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("2-Week Average") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_avg_cl_2W, color = 'ON')) +
geom_line(aes(y = QCOM_avg_cl_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_avg_cl_2W, color = 'INTC')) +
geom_line(aes(y = TSM_avg_cl_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("2-Week Average") +
theme_dark()
One characteristic that immediately becomes apparent is that evaluating the running averages instead of the closing costs seems to “smooth out” the curves — in other words, the running average is much more stable and is not affected by a share’s volatility as much as our original predictors obtained from the CSV. In fact, what we are actually doing is slowly interpolating the data with the overall average; since the overall average is a constant function (and thus linear), the “smoothing out” process is simply a result of interpolating with a \(C^\infty(\mathbb{R})\) (smooth) function.
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_avg_cl_2M, color = 'AMD')) +
geom_line(aes(y = NXPI_avg_cl_2M, color = 'NXPI')) +
geom_line(aes(y = TXN_avg_cl_2M, color = 'TXN')) +
geom_line(aes(y = AMAT_avg_cl_2M, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("2-Month Average") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_avg_cl_2M, color = 'NVDA')) +
geom_line(aes(y = MPWR_avg_cl_2M, color = 'MPWR')) +
geom_line(aes(y = AVGO_avg_cl_2M, color = 'AVGO')) +
geom_line(aes(y = ASML_avg_cl_2M, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("2-Month Average") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_avg_cl_2M, color = 'ON')) +
geom_line(aes(y = QCOM_avg_cl_2M, color = 'QCOM')) +
geom_line(aes(y = INTC_avg_cl_2M, color = 'INTC')) +
geom_line(aes(y = TSM_avg_cl_2M, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("2-Month Average") +
theme_dark()
With a concrete notion of the n-day average closing price of a stock, it is natural to measure the standard deviation as well to gain an accurate insight on the volatility of each stock.
running_deviation <- function(my_vec, num_days) {
#' Takes the running standard deviation of a column vector
#'
#' Creates a new column vector whose entries are the standard deviation of the previous num_days entries.
#' When not enough data is available to take the deviation over num_days, the closest possible
#' average will be taken (for example, if num_days = 10, then the first 2nd entry of the output
#' vector will simply be the average of the first two values, the 3rd entry of the output vector
#' will be the average of the first three values, and so forth.)
#'
#' @param my_vec the column vector to take the standard deviation of
#' @param num_days the number of days one wishes to average over
#'
#' @return A vector whose entries represent the standard deviation of the previous num_days entries in my_vec
# Error handling
if(is.vector(my_vec) == FALSE){
stop("Not Vector: First argument of running_average must be a vector")
}
if(is.numeric(my_vec[1]) == FALSE){
stop("Non-numeric Entries: values of vector in first argument must be numeric.")
}
if(is.numeric(num_days) == FALSE || num_days != round(num_days)){
stop("Not Integer: Second argument of running_average must be an integer larger than or equal to 2")
}
if(num_days <= 1){
stop("Not Large Enough: Second argument of running_average must be an integer larger than or equal to 2")
}
run_avg = running_average(my_vec, num_days)
# dummy variable to keep track of sums
sum_counter = 0
# return variable
output_vec = c()
# Setting the first standard deviation to 0 and beginning the loop
# at 2 prevents a divide by 0 error without adding an additional if-else branch
# in the loop
output_vec[1] = 0
for (i in 2:length(my_vec)) {
# If there are less that num_days of data previous to the current date,
# simply take the average of all the days prior to get the closest thing
# to a running average
if (i <= num_days){
sum_counter = sum_counter + (my_vec[i] - run_avg[i])**2
output_vec[i] = sqrt((sum_counter / (i-1)))
}
else {
# Add the next day to the sum
sum_counter = sum_counter + (my_vec[i] - run_avg[i])**2
# Subtract the data from num_days prior
sum_counter = sum_counter - (my_vec[(i - num_days)] - run_avg[(i-num_days)])**2
output_vec[i] = sqrt((sum_counter / (num_days-1)))
}
}
return(output_vec)
}
SSD$NVDA_std_dev_cl_1W <- running_deviation(SSD$NVDA_Close, 5)
SSD$TSM_std_dev_cl_1W <- running_deviation(SSD$TSM_Close, 5)
SSD$NXPI_std_dev_cl_1W <- running_deviation(SSD$NXPI_Close, 5)
SSD$QCOM_std_dev_cl_1W <- running_deviation(SSD$QCOM_Close, 5)
SSD$MPWR_std_dev_cl_1W <- running_deviation(SSD$MPWR_Close, 5)
SSD$ON_std_dev_cl_1W <- running_deviation(SSD$ON_Close, 5)
SSD$AMD_std_dev_cl_1W <- running_deviation(SSD$AMD_Close, 5)
SSD$INTC_std_dev_cl_1W <- running_deviation(SSD$INTC_Close, 5)
SSD$AVGO_std_dev_cl_1W <- running_deviation(SSD$AVGO_Close, 5)
SSD$ASML_std_dev_cl_1W <- running_deviation(SSD$ASML_Close, 5)
SSD$AMAT_std_dev_cl_1W <- running_deviation(SSD$AMAT_Close, 5)
SSD$TXN_std_dev_cl_1W <- running_deviation(SSD$TXN_Close, 5)
SSD$NVDA_std_dev_cl_2W <- running_deviation(SSD$NVDA_Close, 10)
SSD$TSM_std_dev_cl_2W <- running_deviation(SSD$TSM_Close, 10)
SSD$NXPI_std_dev_cl_2W <- running_deviation(SSD$NXPI_Close, 10)
SSD$QCOM_std_dev_cl_2W <- running_deviation(SSD$QCOM_Close, 10)
SSD$MPWR_std_dev_cl_2W <- running_deviation(SSD$MPWR_Close, 10)
SSD$ON_std_dev_cl_2W <- running_deviation(SSD$ON_Close, 10)
SSD$AMD_std_dev_cl_2W <- running_deviation(SSD$AMD_Close, 10)
SSD$INTC_std_dev_cl_2W <- running_deviation(SSD$INTC_Close, 10)
SSD$AVGO_std_dev_cl_2W <- running_deviation(SSD$AVGO_Close, 10)
SSD$ASML_std_dev_cl_2W <- running_deviation(SSD$ASML_Close, 10)
SSD$AMAT_std_dev_cl_2W <- running_deviation(SSD$AMAT_Close, 10)
SSD$TXN_std_dev_cl_2W <- running_deviation(SSD$TXN_Close, 10)
SSD$NVDA_std_dev_cl_1M <- running_deviation(SSD$NVDA_Close, 20)
SSD$TSM_std_dev_cl_1M <- running_deviation(SSD$TSM_Close, 20)
SSD$NXPI_std_dev_cl_1M <- running_deviation(SSD$NXPI_Close, 20)
SSD$QCOM_std_dev_cl_1M <- running_deviation(SSD$QCOM_Close, 20)
SSD$MPWR_std_dev_cl_1M <- running_deviation(SSD$MPWR_Close, 20)
SSD$ON_std_dev_cl_1M <- running_deviation(SSD$ON_Close, 20)
SSD$AMD_std_dev_cl_1M <- running_deviation(SSD$AMD_Close, 20)
SSD$INTC_std_dev_cl_1M <- running_deviation(SSD$INTC_Close, 20)
SSD$AVGO_std_dev_cl_1M <- running_deviation(SSD$AVGO_Close, 20)
SSD$ASML_std_dev_cl_1M <- running_deviation(SSD$ASML_Close, 20)
SSD$AMAT_std_dev_cl_1M <- running_deviation(SSD$AMAT_Close, 20)
SSD$TXN_std_dev_cl_1M <- running_deviation(SSD$TXN_Close, 20)
SSD$NVDA_std_dev_cl_2M <- running_deviation(SSD$NVDA_Close, 40)
SSD$TSM_std_dev_cl_2M <- running_deviation(SSD$TSM_Close, 40)
SSD$NXPI_std_dev_cl_2M <- running_deviation(SSD$NXPI_Close, 40)
SSD$QCOM_std_dev_cl_2M <- running_deviation(SSD$QCOM_Close, 40)
SSD$MPWR_std_dev_cl_2M <- running_deviation(SSD$MPWR_Close, 40)
SSD$ON_std_dev_cl_2M <- running_deviation(SSD$ON_Close, 40)
SSD$AMD_std_dev_cl_2M <- running_deviation(SSD$AMD_Close, 40)
SSD$INTC_std_dev_cl_2M <- running_deviation(SSD$INTC_Close, 40)
SSD$AVGO_std_dev_cl_2M <- running_deviation(SSD$AVGO_Close, 40)
SSD$ASML_std_dev_cl_2M <- running_deviation(SSD$ASML_Close, 40)
SSD$AMAT_std_dev_cl_2M <- running_deviation(SSD$AMAT_Close, 40)
SSD$TXN_std_dev_cl_2M <- running_deviation(SSD$TXN_Close, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_std_dev_cl_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_std_dev_cl_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_std_dev_cl_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_std_dev_cl_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_std_dev_cl_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_std_dev_cl_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_std_dev_cl_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_std_dev_cl_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_std_dev_cl_2W, color = 'ON')) +
geom_line(aes(y = QCOM_std_dev_cl_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_std_dev_cl_2W, color = 'INTC')) +
geom_line(aes(y = TSM_std_dev_cl_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
One of the main reasons someone invests in a stock is because they believe there is some sort of profit to be made based on the company’s performance. On a day-to-day basis, this is simply measured by the difference between the closing price and the opening price — if the closing price is higher than the opening price, an investor theoretically increased their net worth that day (and vice versa). Although we are technically already measuring the average closing price, this does not account for possible downward trends since it is simply taking the average value of a set of prices; on the other hand, looking at the difference between opening and closing price gives some short-term insight into the overall trend of a stock on a daily basis.
SSD$NVDA_Return <- (SSD$NVDA_Adj_Close - SSD$NVDA_Open)
SSD$TSM_Return <- (SSD$TSM_Adj_Close - SSD$TSM_Open)
SSD$NXPI_Return <- (SSD$NXPI_Adj_Close - SSD$NXPI_Open)
SSD$QCOM_Return <- (SSD$QCOM_Adj_Close - SSD$QCOM_Open)
SSD$MPWR_Return <- (SSD$MPWR_Adj_Close - SSD$MPWR_Open)
SSD$ON_Return <- (SSD$ON_Adj_Close - SSD$ON_Open)
SSD$AMD_Return <- (SSD$AMD_Adj_Close - SSD$AMD_Open)
SSD$INTC_Return <- (SSD$INTC_Adj_Close - SSD$INTC_Open)
SSD$AVGO_Return <- (SSD$AVGO_Adj_Close - SSD$AVGO_Open)
SSD$ASML_Return <- (SSD$ASML_Adj_Close - SSD$ASML_Open)
SSD$AMAT_Return <- (SSD$AMAT_Adj_Close - SSD$AMAT_Open)
SSD$TXN_Return <- (SSD$TXN_Adj_Close - SSD$TXN_Open)
SSD$NVDA_avg_ret_1W <- running_average(SSD$NVDA_Return, 5)
SSD$TSM_avg_ret_1W <- running_average(SSD$TSM_Return, 5)
SSD$NXPI_avg_ret_1W <- running_average(SSD$NXPI_Return, 5)
SSD$QCOM_avg_ret_1W <- running_average(SSD$QCOM_Return, 5)
SSD$MPWR_avg_ret_1W <- running_average(SSD$MPWR_Return, 5)
SSD$ON_avg_ret_1W <- running_average(SSD$ON_Return, 5)
SSD$AMD_avg_ret_1W <- running_average(SSD$AMD_Return, 5)
SSD$INTC_avg_ret_1W <- running_average(SSD$INTC_Return, 5)
SSD$AVGO_avg_ret_1W <- running_average(SSD$AVGO_Return, 5)
SSD$ASML_avg_ret_1W <- running_average(SSD$ASML_Return, 5)
SSD$AMAT_avg_ret_1W <- running_average(SSD$AMAT_Return, 5)
SSD$TXN_avg_ret_1W <- running_average(SSD$TXN_Return, 5)
SSD$NVDA_avg_ret_2W <- running_average(SSD$NVDA_Return, 10)
SSD$TSM_avg_ret_2W <- running_average(SSD$TSM_Return, 10)
SSD$NXPI_avg_ret_2W <- running_average(SSD$NXPI_Return, 10)
SSD$QCOM_avg_ret_2W <- running_average(SSD$QCOM_Return, 10)
SSD$MPWR_avg_ret_2W <- running_average(SSD$MPWR_Return, 10)
SSD$ON_avg_ret_2W <- running_average(SSD$ON_Return, 10)
SSD$AMD_avg_ret_2W <- running_average(SSD$AMD_Return, 10)
SSD$INTC_avg_ret_2W <- running_average(SSD$INTC_Return, 10)
SSD$AVGO_avg_ret_2W <- running_average(SSD$AVGO_Return, 10)
SSD$ASML_avg_ret_2W <- running_average(SSD$ASML_Return, 10)
SSD$AMAT_avg_ret_2W <- running_average(SSD$AMAT_Return, 10)
SSD$TXN_avg_ret_2W <- running_average(SSD$TXN_Return, 10)
SSD$NVDA_avg_ret_1M <- running_average(SSD$NVDA_Return, 20)
SSD$TSM_avg_ret_1M <- running_average(SSD$TSM_Return, 20)
SSD$NXPI_avg_ret_1M <- running_average(SSD$NXPI_Return, 20)
SSD$QCOM_avg_ret_1M <- running_average(SSD$QCOM_Return, 20)
SSD$MPWR_avg_ret_1M <- running_average(SSD$MPWR_Return, 20)
SSD$ON_avg_ret_1M <- running_average(SSD$ON_Return, 20)
SSD$AMD_avg_ret_1M <- running_average(SSD$AMD_Return, 20)
SSD$INTC_avg_ret_1M <- running_average(SSD$INTC_Return, 20)
SSD$AVGO_avg_ret_1M <- running_average(SSD$AVGO_Return, 20)
SSD$ASML_avg_ret_1M <- running_average(SSD$ASML_Return, 20)
SSD$AMAT_avg_ret_1M <- running_average(SSD$AMAT_Return, 20)
SSD$TXN_avg_ret_1M <- running_average(SSD$TXN_Return, 20)
SSD$NVDA_avg_ret_2M <- running_average(SSD$NVDA_Return, 40)
SSD$TSM_avg_ret_2M <- running_average(SSD$TSM_Return, 40)
SSD$NXPI_avg_ret_2M <- running_average(SSD$NXPI_Return, 40)
SSD$QCOM_avg_ret_2M <- running_average(SSD$QCOM_Return, 40)
SSD$MPWR_avg_ret_2M <- running_average(SSD$MPWR_Return, 40)
SSD$ON_avg_ret_2M <- running_average(SSD$ON_Return, 40)
SSD$AMD_avg_ret_2M <- running_average(SSD$AMD_Return, 40)
SSD$INTC_avg_ret_2M <- running_average(SSD$INTC_Return, 40)
SSD$AVGO_avg_ret_2M <- running_average(SSD$AVGO_Return, 40)
SSD$ASML_avg_ret_2M <- running_average(SSD$ASML_Return, 40)
SSD$AMAT_avg_ret_2M <- running_average(SSD$AMAT_Return, 40)
SSD$TXN_avg_ret_2M <- running_average(SSD$TXN_Return, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_avg_ret_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_avg_ret_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_avg_ret_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_avg_ret_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("2-Week Average Return") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_avg_ret_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_avg_ret_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_avg_ret_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_avg_ret_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("2-Week Average Return") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_avg_ret_2W, color = 'ON')) +
geom_line(aes(y = QCOM_avg_ret_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_avg_ret_2W, color = 'INTC')) +
geom_line(aes(y = TSM_avg_ret_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("2-Week Average Return") +
theme_dark()
SSD$NVDA_std_dev_ret_1W <- running_deviation(SSD$NVDA_Return, 5)
SSD$TSM_std_dev_ret_1W <- running_deviation(SSD$TSM_Return, 5)
SSD$NXPI_std_dev_ret_1W <- running_deviation(SSD$NXPI_Return, 5)
SSD$QCOM_std_dev_ret_1W <- running_deviation(SSD$QCOM_Return, 5)
SSD$MPWR_std_dev_ret_1W <- running_deviation(SSD$MPWR_Return, 5)
SSD$ON_std_dev_ret_1W <- running_deviation(SSD$ON_Return, 5)
SSD$AMD_std_dev_ret_1W <- running_deviation(SSD$AMD_Return, 5)
SSD$INTC_std_dev_ret_1W <- running_deviation(SSD$INTC_Return, 5)
SSD$AVGO_std_dev_ret_1W <- running_deviation(SSD$AVGO_Return, 5)
SSD$ASML_std_dev_ret_1W <- running_deviation(SSD$ASML_Return, 5)
SSD$AMAT_std_dev_ret_1W <- running_deviation(SSD$AMAT_Return, 5)
SSD$TXN_std_dev_ret_1W <- running_deviation(SSD$TXN_Return, 5)
SSD$NVDA_std_dev_ret_2W <- running_deviation(SSD$NVDA_Return, 10)
SSD$TSM_std_dev_ret_2W <- running_deviation(SSD$TSM_Return, 10)
SSD$NXPI_std_dev_ret_2W <- running_deviation(SSD$NXPI_Return, 10)
SSD$QCOM_std_dev_ret_2W <- running_deviation(SSD$QCOM_Return, 10)
SSD$MPWR_std_dev_ret_2W <- running_deviation(SSD$MPWR_Return, 10)
SSD$ON_std_dev_ret_2W <- running_deviation(SSD$ON_Return, 10)
SSD$AMD_std_dev_ret_2W <- running_deviation(SSD$AMD_Return, 10)
SSD$INTC_std_dev_ret_2W <- running_deviation(SSD$INTC_Return, 10)
SSD$AVGO_std_dev_ret_2W <- running_deviation(SSD$AVGO_Return, 10)
SSD$ASML_std_dev_ret_2W <- running_deviation(SSD$ASML_Return, 10)
SSD$AMAT_std_dev_ret_2W <- running_deviation(SSD$AMAT_Return, 10)
SSD$TXN_std_dev_ret_2W <- running_deviation(SSD$TXN_Return, 10)
SSD$NVDA_std_dev_ret_1M <- running_deviation(SSD$NVDA_Return, 20)
SSD$TSM_std_dev_ret_1M <- running_deviation(SSD$TSM_Return, 20)
SSD$NXPI_std_dev_ret_1M <- running_deviation(SSD$NXPI_Return, 20)
SSD$QCOM_std_dev_ret_1M <- running_deviation(SSD$QCOM_Return, 20)
SSD$MPWR_std_dev_ret_1M <- running_deviation(SSD$MPWR_Return, 20)
SSD$ON_std_dev_ret_1M <- running_deviation(SSD$ON_Return, 20)
SSD$AMD_std_dev_ret_1M <- running_deviation(SSD$AMD_Return, 20)
SSD$INTC_std_dev_ret_1M <- running_deviation(SSD$INTC_Return, 20)
SSD$AVGO_std_dev_ret_1M <- running_deviation(SSD$AVGO_Return, 20)
SSD$ASML_std_dev_ret_1M <- running_deviation(SSD$ASML_Return, 20)
SSD$AMAT_std_dev_ret_1M <- running_deviation(SSD$AMAT_Return, 20)
SSD$TXN_std_dev_ret_1M <- running_deviation(SSD$TXN_Return, 20)
SSD$NVDA_std_dev_ret_2M <- running_deviation(SSD$NVDA_Return, 40)
SSD$TSM_std_dev_ret_2M <- running_deviation(SSD$TSM_Return, 40)
SSD$NXPI_std_dev_ret_2M <- running_deviation(SSD$NXPI_Return, 40)
SSD$QCOM_std_dev_ret_2M <- running_deviation(SSD$QCOM_Return, 40)
SSD$MPWR_std_dev_ret_2M <- running_deviation(SSD$MPWR_Return, 40)
SSD$ON_std_dev_ret_2M <- running_deviation(SSD$ON_Return, 40)
SSD$AMD_std_dev_ret_2M <- running_deviation(SSD$AMD_Return, 40)
SSD$INTC_std_dev_ret_2M <- running_deviation(SSD$INTC_Return, 40)
SSD$AVGO_std_dev_ret_2M <- running_deviation(SSD$AVGO_Return, 40)
SSD$ASML_std_dev_ret_2M <- running_deviation(SSD$ASML_Return, 40)
SSD$AMAT_std_dev_ret_2M <- running_deviation(SSD$AMAT_Return, 40)
SSD$TXN_std_dev_ret_2M <- running_deviation(SSD$TXN_Return, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_std_dev_ret_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_std_dev_ret_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_std_dev_ret_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_std_dev_ret_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_std_dev_ret_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_std_dev_ret_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_std_dev_ret_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_std_dev_ret_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_std_dev_ret_2W, color = 'ON')) +
geom_line(aes(y = QCOM_std_dev_ret_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_std_dev_ret_2W, color = 'INTC')) +
geom_line(aes(y = TSM_std_dev_ret_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
ggtitle("2-Week Standard Deviation") +
theme_dark()
SSD$NVDA_avg_vol_1W <- running_average(SSD$NVDA_Volume, 5)
SSD$TSM_avg_vol_1W <- running_average(SSD$TSM_Volume, 5)
SSD$NXPI_avg_vol_1W <- running_average(SSD$NXPI_Volume, 5)
SSD$QCOM_avg_vol_1W <- running_average(SSD$QCOM_Volume, 5)
SSD$MPWR_avg_vol_1W <- running_average(SSD$MPWR_Volume, 5)
SSD$ON_avg_vol_1W <- running_average(SSD$ON_Volume, 5)
SSD$AMD_avg_vol_1W <- running_average(SSD$AMD_Volume, 5)
SSD$INTC_avg_vol_1W <- running_average(SSD$INTC_Volume, 5)
SSD$AVGO_avg_vol_1W <- running_average(SSD$AVGO_Volume, 5)
SSD$ASML_avg_vol_1W <- running_average(SSD$ASML_Volume, 5)
SSD$AMAT_avg_vol_1W <- running_average(SSD$AMAT_Volume, 5)
SSD$TXN_avg_vol_1W <- running_average(SSD$TXN_Volume, 5)
SSD$NVDA_avg_vol_2W <- running_average(SSD$NVDA_Volume, 10)
SSD$TSM_avg_vol_2W <- running_average(SSD$TSM_Volume, 10)
SSD$NXPI_avg_vol_2W <- running_average(SSD$NXPI_Volume, 10)
SSD$QCOM_avg_vol_2W <- running_average(SSD$QCOM_Volume, 10)
SSD$MPWR_avg_vol_2W <- running_average(SSD$MPWR_Volume, 10)
SSD$ON_avg_vol_2W <- running_average(SSD$ON_Volume, 10)
SSD$AMD_avg_vol_2W <- running_average(SSD$AMD_Volume, 10)
SSD$INTC_avg_vol_2W <- running_average(SSD$INTC_Volume, 10)
SSD$AVGO_avg_vol_2W <- running_average(SSD$AVGO_Volume, 10)
SSD$ASML_avg_vol_2W <- running_average(SSD$ASML_Volume, 10)
SSD$AMAT_avg_vol_2W <- running_average(SSD$AMAT_Volume, 10)
SSD$TXN_avg_vol_2W <- running_average(SSD$TXN_Volume, 10)
SSD$NVDA_avg_vol_1M <- running_average(SSD$NVDA_Volume, 20)
SSD$TSM_avg_vol_1M <- running_average(SSD$TSM_Volume, 20)
SSD$NXPI_avg_vol_1M <- running_average(SSD$NXPI_Volume, 20)
SSD$QCOM_avg_vol_1M <- running_average(SSD$QCOM_Volume, 20)
SSD$MPWR_avg_vol_1M <- running_average(SSD$MPWR_Volume, 20)
SSD$ON_avg_vol_1M <- running_average(SSD$ON_Volume, 20)
SSD$AMD_avg_vol_1M <- running_average(SSD$AMD_Volume, 20)
SSD$INTC_avg_vol_1M <- running_average(SSD$INTC_Volume, 20)
SSD$AVGO_avg_vol_1M <- running_average(SSD$AVGO_Volume, 20)
SSD$ASML_avg_vol_1M <- running_average(SSD$ASML_Volume, 20)
SSD$AMAT_avg_vol_1M <- running_average(SSD$AMAT_Volume, 20)
SSD$TXN_avg_vol_1M <- running_average(SSD$TXN_Volume, 20)
SSD$NVDA_avg_vol_2M <- running_average(SSD$NVDA_Volume, 40)
SSD$TSM_avg_vol_2M <- running_average(SSD$TSM_Volume, 40)
SSD$NXPI_avg_vol_2M <- running_average(SSD$NXPI_Volume, 40)
SSD$QCOM_avg_vol_2M <- running_average(SSD$QCOM_Volume, 40)
SSD$MPWR_avg_vol_2M <- running_average(SSD$MPWR_Volume, 40)
SSD$ON_avg_vol_2M <- running_average(SSD$ON_Volume, 40)
SSD$AMD_avg_vol_2M <- running_average(SSD$AMD_Volume, 40)
SSD$INTC_avg_vol_2M <- running_average(SSD$INTC_Volume, 40)
SSD$AVGO_avg_vol_2M <- running_average(SSD$AVGO_Volume, 40)
SSD$ASML_avg_vol_2M <- running_average(SSD$ASML_Volume, 40)
SSD$AMAT_avg_vol_2M <- running_average(SSD$AMAT_Volume, 40)
SSD$TXN_avg_vol_2M <- running_average(SSD$TXN_Volume, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_avg_vol_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_avg_vol_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_avg_vol_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_avg_vol_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Average Volume") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_avg_vol_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_avg_vol_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_avg_vol_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_avg_vol_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Average Volume") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_avg_vol_2W, color = 'ON')) +
geom_line(aes(y = QCOM_avg_vol_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_avg_vol_2W, color = 'INTC')) +
geom_line(aes(y = TSM_avg_vol_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Average Volume") +
theme_dark()
SSD$NVDA_std_dev_vol_1W <- running_deviation(SSD$NVDA_Volume, 5)
SSD$TSM_std_dev_vol_1W <- running_deviation(SSD$TSM_Volume, 5)
SSD$NXPI_std_dev_vol_1W <- running_deviation(SSD$NXPI_Volume, 5)
SSD$QCOM_std_dev_vol_1W <- running_deviation(SSD$QCOM_Volume, 5)
SSD$MPWR_std_dev_vol_1W <- running_deviation(SSD$MPWR_Volume, 5)
SSD$ON_std_dev_vol_1W <- running_deviation(SSD$ON_Volume, 5)
SSD$AMD_std_dev_vol_1W <- running_deviation(SSD$AMD_Volume, 5)
SSD$INTC_std_dev_vol_1W <- running_deviation(SSD$INTC_Volume, 5)
SSD$AVGO_std_dev_vol_1W <- running_deviation(SSD$AVGO_Volume, 5)
SSD$ASML_std_dev_vol_1W <- running_deviation(SSD$ASML_Volume, 5)
SSD$AMAT_std_dev_vol_1W <- running_deviation(SSD$AMAT_Volume, 5)
SSD$TXN_std_dev_vol_1W <- running_deviation(SSD$TXN_Volume, 5)
SSD$NVDA_std_dev_vol_2W <- running_deviation(SSD$NVDA_Volume, 10)
SSD$TSM_std_dev_vol_2W <- running_deviation(SSD$TSM_Volume, 10)
SSD$NXPI_std_dev_vol_2W <- running_deviation(SSD$NXPI_Volume, 10)
SSD$QCOM_std_dev_vol_2W <- running_deviation(SSD$QCOM_Volume, 10)
SSD$MPWR_std_dev_vol_2W <- running_deviation(SSD$MPWR_Volume, 10)
SSD$ON_std_dev_vol_2W <- running_deviation(SSD$ON_Volume, 10)
SSD$AMD_std_dev_vol_2W <- running_deviation(SSD$AMD_Volume, 10)
SSD$INTC_std_dev_vol_2W <- running_deviation(SSD$INTC_Volume, 10)
SSD$AVGO_std_dev_vol_2W <- running_deviation(SSD$AVGO_Volume, 10)
SSD$ASML_std_dev_vol_2W <- running_deviation(SSD$ASML_Volume, 10)
SSD$AMAT_std_dev_vol_2W <- running_deviation(SSD$AMAT_Volume, 10)
SSD$TXN_std_dev_vol_2W <- running_deviation(SSD$TXN_Volume, 10)
SSD$NVDA_std_dev_vol_1M <- running_deviation(SSD$NVDA_Volume, 20)
SSD$TSM_std_dev_vol_1M <- running_deviation(SSD$TSM_Volume, 20)
SSD$NXPI_std_dev_vol_1M <- running_deviation(SSD$NXPI_Volume, 20)
SSD$QCOM_std_dev_vol_1M <- running_deviation(SSD$QCOM_Volume, 20)
SSD$MPWR_std_dev_vol_1M <- running_deviation(SSD$MPWR_Volume, 20)
SSD$ON_std_dev_vol_1M <- running_deviation(SSD$ON_Volume, 20)
SSD$AMD_std_dev_vol_1M <- running_deviation(SSD$AMD_Volume, 20)
SSD$INTC_std_dev_vol_1M <- running_deviation(SSD$INTC_Volume, 20)
SSD$AVGO_std_dev_vol_1M <- running_deviation(SSD$AVGO_Volume, 20)
SSD$ASML_std_dev_vol_1M <- running_deviation(SSD$ASML_Volume, 20)
SSD$AMAT_std_dev_vol_1M <- running_deviation(SSD$AMAT_Volume, 20)
SSD$TXN_std_dev_vol_1M <- running_deviation(SSD$TXN_Volume, 20)
SSD$NVDA_std_dev_vol_2M <- running_deviation(SSD$NVDA_Volume, 40)
SSD$TSM_std_dev_vol_2M <- running_deviation(SSD$TSM_Volume, 40)
SSD$NXPI_std_dev_vol_2M <- running_deviation(SSD$NXPI_Volume, 40)
SSD$QCOM_std_dev_vol_2M <- running_deviation(SSD$QCOM_Volume, 40)
SSD$MPWR_std_dev_vol_2M <- running_deviation(SSD$MPWR_Volume, 40)
SSD$ON_std_dev_vol_2M <- running_deviation(SSD$ON_Volume, 40)
SSD$AMD_std_dev_vol_2M <- running_deviation(SSD$AMD_Volume, 40)
SSD$INTC_std_dev_vol_2M <- running_deviation(SSD$INTC_Volume, 40)
SSD$AVGO_std_dev_vol_2M <- running_deviation(SSD$AVGO_Volume, 40)
SSD$ASML_std_dev_vol_2M <- running_deviation(SSD$ASML_Volume, 40)
SSD$AMAT_std_dev_vol_2M <- running_deviation(SSD$AMAT_Volume, 40)
SSD$TXN_std_dev_vol_2M <- running_deviation(SSD$TXN_Volume, 40)
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = AMD_std_dev_vol_2W, color = 'AMD')) +
geom_line(aes(y = NXPI_std_dev_vol_2W, color = 'NXPI')) +
geom_line(aes(y = TXN_std_dev_vol_2W, color = 'TXN')) +
geom_line(aes(y = AMAT_std_dev_vol_2W, color = 'AMAT')) +
scale_color_manual(values = c(
'AMD' = 'green',
'AMAT' = 'white',
'NXPI' = 'pink',
'TXN' = 'lightblue')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = NVDA_std_dev_vol_2W, color = 'NVDA')) +
geom_line(aes(y = MPWR_std_dev_vol_2W, color = 'MPWR')) +
geom_line(aes(y = AVGO_std_dev_vol_2W, color = 'AVGO')) +
geom_line(aes(y = ASML_std_dev_vol_2W, color = 'ASML')) +
scale_color_manual(values = c(
'NVDA' = 'darkolivegreen1',
'MPWR' = 'moccasin',
'AVGO' = 'coral',
'ASML' = 'gold')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Standard Deviation") +
theme_dark()
ggplot(data = SSD, aes(x=Date)) +
geom_line(aes(y = ON_std_dev_vol_2W, color = 'ON')) +
geom_line(aes(y = QCOM_std_dev_vol_2W, color = 'QCOM')) +
geom_line(aes(y = INTC_std_dev_vol_2W, color = 'INTC')) +
geom_line(aes(y = TSM_std_dev_vol_2W, color = 'TSM')) +
scale_color_manual(values = c(
'ON' = 'cyan',
'QCOM' = 'purple',
'INTC' = 'yellow',
'TSM' = 'red')) +
ylab('USD') +
scale_y_continuous( labels = label_comma()) +
ggtitle("2-Week Standard Deviation") +
theme_dark()
While having a large array of predictors is in some sense useful for seeing the whole picture of the semiconductor market for the 2023-2024 fiscal year, there is also a potentially significant amount of unnecessary information. As mentioned prior, the behavior of many of our initial predictors coming from the CSV files are very closely related to one another — the closing price one day is directly tied to the opening price of the following day, and if a stock’s minimum / Low value is increasing that generally means all 4 other predictors (aside from volume) are increasing as well. In addition, comparing the performance between two stocks is generally going to be heavily correlated due to the fact that they both follow the underlying market’s climate.
Ultimately, in order to achieve a good understanding of the correlations between all of our predictors we will need to cross examine several subsets of our predictors to see which predictors are correlated for a single stock, and which predictors are useful for measuring competition between stocks. Dividing our correlation plots into two types, we first examine how the predictors are correlated for a fixed stock, and test this underlying trend accross a subset of our stocks (ASML, INTC, NVDA, and NXPI ):
select(SSD, starts_with("INTC")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="INTC Correlation Plot")
select(SSD, starts_with("NXPI")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="NXPI Correlation Plot")
select(SSD, starts_with("NVDA")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="NVDA Correlation Plot")
select(SSD, starts_with("ASML")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", diag = FALSE, tl.cex=0.6, title="ASML Correlation Plot")
From the analysis above, we can somewhat immediately conclude that five
of our six original predictors from the CSV file (everything except
volume) are very closely correlated. Additionally, since the running
averages are defined in terms of the closing cost for each stock
individually, it is no surprise that for longer time intervals the
running average is closely correlated to the closing cost and thus the
remaining original predictors. However, one of the more surprising
correlations that one might not have expected is the mildly positive
relationship between the volume of a stock and its normalized volatility
— in fact, one might have initially expected a stock that behaves more
unpredictably would be traded less, though the correlation plot
indicates otherwise. Laslty, we can see from each of the added
predictors based on a time-window hyper-parameter that larger values of
the time window (i.e. two months) give little to no new insight into the
behavior of a stocks value.
With the predictors for a fixed stock thoroughly analyzed, the next important subset of predictors to cross-examine is when the predictor type is fixed and the stock itself is allowed to vary. As we saw from the correlation plots above, several of our predictors for a fixed stock are closely related to one another — thus, there isn’t any reason to examine all 13 predictors across our different manufacturers. Instead, we focus on a subset that has minimal pairwise-correlation: Volume, 2-Week Average (Closing) Price, 2-Week Standard Deviation, and 2-Week Normalized Volatility.
select(SSD, ends_with("avg_vol_2W")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="Volumes")
select(SSD, ends_with("avg_cl_2W")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Averages")
select(SSD, ends_with("std_dev_cl_2W")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Standard Deviations")
select(SSD, ends_with("avg_ret_2W")) %>%
cor() %>%
corrplot(method = "circle", type = "lower", tl.cex=0.75, diag = FALSE, title="2-Week Normalized Volatilities")
There are a few key takeaways from this correlation analysis; foremost,
there is not a significant relationship in the volume of stocks sold
between any two companies (besides possibly TXN and NXPI). Second, the
fact that most manufacturers’ closing stock prices are heavily
correlated means they are much more heavily affected by the overall
market trends than competitors’ actions — however, there is one
exception to this trend: ON Semiconductor Corporation. Lastly,
With a better picture in mind of how our stock prices can be measured from both the given metrics and how they interact with one another, we can now set up our data and begin training our models. This will be done in several steps, first preparing the data to ensure that our models do not become over-fitted to a particular data-set.
One of the primary ways we ensure robustness of our models is by partitioning our data into training and testing data. Foremost, this ensures that our model does not become overfit to the details and noise of our underlying data-set by introducing a portion of the data which is unseen during the training phase (i.e. the testing data). Ultimately, one would want outcome variable to have similar statistics / variance across both the training and testing sets — this is accomplished by stratifying our split about the desired outcome variable.
SSD_split_1W <- initial_split(SSD, prop = 0.7,
strata = NVDA_avg_cl_1W)
SSD_train_1W <- training(SSD_split_1W)
SSD_test_1W <- testing(SSD_split_1W)
SSD_split_2W <- initial_split(SSD, prop = 0.7,
strata = NVDA_avg_cl_2W)
SSD_train_2W <- training(SSD_split_2W)
SSD_test_2W <- testing(SSD_split_2W)
SSD_split_1M <- initial_split(SSD, prop = 0.7,
strata = NVDA_avg_cl_1M)
SSD_train_1M <- training(SSD_split_1M)
SSD_test_1M <- testing(SSD_split_1M)
SSD_split_2M <- initial_split(SSD, prop = 0.7,
strata = NVDA_avg_cl_2M)
SSD_train_2M <- training(SSD_split_2M)
SSD_test_2M <- testing(SSD_split_2M)
SSD_recipe_1W = recipe(
NVDA_avg_cl_1W ~ NVDA_std_dev_cl_1W + NVDA_avg_ret_1W + NVDA_std_dev_ret_1W + NVDA_avg_vol_1W + NVDA_std_dev_vol_1W +
TSM_avg_cl_1W + TSM_std_dev_cl_1W + TSM_avg_ret_1W + TSM_std_dev_ret_1W + TSM_avg_vol_1W + TSM_std_dev_vol_1W +
NXPI_avg_cl_1W + NXPI_std_dev_cl_1W + NXPI_avg_ret_1W + NXPI_std_dev_ret_1W + NXPI_avg_vol_1W + NXPI_std_dev_vol_1W + QCOM_avg_cl_1W + QCOM_std_dev_cl_1W + QCOM_avg_ret_1W + QCOM_std_dev_ret_1W + QCOM_avg_vol_1W + QCOM_std_dev_vol_1W + MPWR_avg_cl_1W + MPWR_std_dev_cl_1W + MPWR_avg_ret_1W + MPWR_std_dev_ret_1W + MPWR_avg_vol_1W + MPWR_std_dev_vol_1W + ON_avg_cl_1W + ON_std_dev_cl_1W + ON_avg_ret_1W + ON_std_dev_ret_1W + ON_avg_vol_1W + ON_std_dev_vol_1W + AMD_avg_cl_1W + AMD_std_dev_cl_1W + AMD_avg_ret_1W + AMD_std_dev_ret_1W + AMD_avg_vol_1W + AMD_std_dev_vol_1W + INTC_avg_cl_1W + INTC_std_dev_cl_1W + INTC_avg_ret_1W + INTC_std_dev_ret_1W + INTC_avg_vol_1W + INTC_std_dev_vol_1W + AVGO_avg_cl_1W + AVGO_std_dev_cl_1W + AVGO_avg_ret_1W + AVGO_std_dev_ret_1W + AVGO_avg_vol_1W + AVGO_std_dev_vol_1W + ASML_avg_cl_1W + ASML_std_dev_cl_1W + ASML_avg_ret_1W + ASML_std_dev_ret_1W + ASML_avg_vol_1W + ASML_std_dev_vol_1W + AMAT_avg_cl_1W + AMAT_std_dev_cl_1W + AMAT_avg_ret_1W + AMAT_std_dev_ret_1W + AMAT_avg_vol_1W + AMAT_std_dev_vol_1W + TXN_avg_cl_1W + TXN_std_dev_cl_1W + TXN_avg_ret_1W + TXN_std_dev_ret_1W + TXN_avg_vol_1W + TXN_std_dev_vol_1W,
data=SSD_train_1W) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
SSD_recipe_2W = recipe(
NVDA_avg_cl_2W ~ NVDA_std_dev_cl_2W + NVDA_avg_ret_2W + NVDA_std_dev_ret_2W + NVDA_avg_vol_2W + NVDA_std_dev_vol_2W +
TSM_avg_cl_2W + TSM_std_dev_cl_2W + TSM_avg_ret_2W + TSM_std_dev_ret_2W + TSM_avg_vol_2W + TSM_std_dev_vol_2W +
NXPI_avg_cl_2W + NXPI_std_dev_cl_2W + NXPI_avg_ret_2W + NXPI_std_dev_ret_2W + NXPI_avg_vol_2W + NXPI_std_dev_vol_2W + QCOM_avg_cl_2W + QCOM_std_dev_cl_2W + QCOM_avg_ret_2W + QCOM_std_dev_ret_2W + QCOM_avg_vol_2W + QCOM_std_dev_vol_2W + MPWR_avg_cl_2W + MPWR_std_dev_cl_2W + MPWR_avg_ret_2W + MPWR_std_dev_ret_2W + MPWR_avg_vol_2W + MPWR_std_dev_vol_2W + ON_avg_cl_2W + ON_std_dev_cl_2W + ON_avg_ret_2W + ON_std_dev_ret_2W + ON_avg_vol_2W + ON_std_dev_vol_2W + AMD_avg_cl_2W + AMD_std_dev_cl_2W + AMD_avg_ret_2W + AMD_std_dev_ret_2W + AMD_avg_vol_2W + AMD_std_dev_vol_2W + INTC_avg_cl_2W + INTC_std_dev_cl_2W + INTC_avg_ret_2W + INTC_std_dev_ret_2W + INTC_avg_vol_2W + INTC_std_dev_vol_2W + AVGO_avg_cl_2W + AVGO_std_dev_cl_2W + AVGO_avg_ret_2W + AVGO_std_dev_ret_2W + AVGO_avg_vol_2W + AVGO_std_dev_vol_2W + ASML_avg_cl_2W + ASML_std_dev_cl_2W + ASML_avg_ret_2W + ASML_std_dev_ret_2W + ASML_avg_vol_2W + ASML_std_dev_vol_2W + AMAT_avg_cl_2W + AMAT_std_dev_cl_2W + AMAT_avg_ret_2W + AMAT_std_dev_ret_2W + AMAT_avg_vol_2W + AMAT_std_dev_vol_2W + TXN_avg_cl_2W + TXN_std_dev_cl_2W + TXN_avg_ret_2W + TXN_std_dev_ret_2W + TXN_avg_vol_2W + TXN_std_dev_vol_2W,
data=SSD_train_2W) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
SSD_recipe_1M = recipe(
NVDA_avg_cl_1M ~ NVDA_std_dev_cl_1M + NVDA_avg_ret_1M + NVDA_std_dev_ret_1M + NVDA_avg_vol_1M + NVDA_std_dev_vol_1M +
TSM_avg_cl_1M + TSM_std_dev_cl_1M + TSM_avg_ret_1M + TSM_std_dev_ret_1M + TSM_avg_vol_1M + TSM_std_dev_vol_1M +
NXPI_avg_cl_1M + NXPI_std_dev_cl_1M + NXPI_avg_ret_1M + NXPI_std_dev_ret_1M + NXPI_avg_vol_1M + NXPI_std_dev_vol_1M + QCOM_avg_cl_1M + QCOM_std_dev_cl_1M + QCOM_avg_ret_1M + QCOM_std_dev_ret_1M + QCOM_avg_vol_1M + QCOM_std_dev_vol_1M + MPWR_avg_cl_1M + MPWR_std_dev_cl_1M + MPWR_avg_ret_1M + MPWR_std_dev_ret_1M + MPWR_avg_vol_1M + MPWR_std_dev_vol_1M + ON_avg_cl_1M + ON_std_dev_cl_1M + ON_avg_ret_1M + ON_std_dev_ret_1M + ON_avg_vol_1M + ON_std_dev_vol_1M + AMD_avg_cl_1M + AMD_std_dev_cl_1M + AMD_avg_ret_1M + AMD_std_dev_ret_1M + AMD_avg_vol_1M + AMD_std_dev_vol_1M + INTC_avg_cl_1M + INTC_std_dev_cl_1M + INTC_avg_ret_1M + INTC_std_dev_ret_1M + INTC_avg_vol_1M + INTC_std_dev_vol_1M + AVGO_avg_cl_1M + AVGO_std_dev_cl_1M + AVGO_avg_ret_1M + AVGO_std_dev_ret_1M + AVGO_avg_vol_1M + AVGO_std_dev_vol_1M + ASML_avg_cl_1M + ASML_std_dev_cl_1M + ASML_avg_ret_1M + ASML_std_dev_ret_1M + ASML_avg_vol_1M + ASML_std_dev_vol_1M + AMAT_avg_cl_1M + AMAT_std_dev_cl_1M + AMAT_avg_ret_1M + AMAT_std_dev_ret_1M + AMAT_avg_vol_1M + AMAT_std_dev_vol_1M + TXN_avg_cl_1M + TXN_std_dev_cl_1M + TXN_avg_ret_1M + TXN_std_dev_ret_1M + TXN_avg_vol_1M + TXN_std_dev_vol_1M,
data=SSD_train_1M) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
SSD_recipe_2M = recipe(
NVDA_avg_cl_2M ~ NVDA_std_dev_cl_2M + NVDA_avg_ret_2M + NVDA_std_dev_ret_2M + NVDA_avg_vol_2M + NVDA_std_dev_vol_2M +
TSM_avg_cl_2M + TSM_std_dev_cl_2M + TSM_avg_ret_2M + TSM_std_dev_ret_2M + TSM_avg_vol_2M + TSM_std_dev_vol_2M +
NXPI_avg_cl_2M + NXPI_std_dev_cl_2M + NXPI_avg_ret_2M + NXPI_std_dev_ret_2M + NXPI_avg_vol_2M + NXPI_std_dev_vol_2M + QCOM_avg_cl_2M + QCOM_std_dev_cl_2M + QCOM_avg_ret_2M + QCOM_std_dev_ret_2M + QCOM_avg_vol_2M + QCOM_std_dev_vol_2M + MPWR_avg_cl_2M + MPWR_std_dev_cl_2M + MPWR_avg_ret_2M + MPWR_std_dev_ret_2M + MPWR_avg_vol_2M + MPWR_std_dev_vol_2M + ON_avg_cl_2M + ON_std_dev_cl_2M + ON_avg_ret_2M + ON_std_dev_ret_2M + ON_avg_vol_2M + ON_std_dev_vol_2M + AMD_avg_cl_2M + AMD_std_dev_cl_2M + AMD_avg_ret_2M + AMD_std_dev_ret_2M + AMD_avg_vol_2M + AMD_std_dev_vol_2M + INTC_avg_cl_2M + INTC_std_dev_cl_2M + INTC_avg_ret_2M + INTC_std_dev_ret_2M + INTC_avg_vol_2M + INTC_std_dev_vol_2M + AVGO_avg_cl_2M + AVGO_std_dev_cl_2M + AVGO_avg_ret_2M + AVGO_std_dev_ret_2M + AVGO_avg_vol_2M + AVGO_std_dev_vol_2M + ASML_avg_cl_2M + ASML_std_dev_cl_2M + ASML_avg_ret_2M + ASML_std_dev_ret_2M + ASML_avg_vol_2M + ASML_std_dev_vol_2M + AMAT_avg_cl_2M + AMAT_std_dev_cl_2M + AMAT_avg_ret_2M + AMAT_std_dev_ret_2M + AMAT_avg_vol_2M + AMAT_std_dev_vol_2M + TXN_avg_cl_2M + TXN_std_dev_cl_2M + TXN_avg_ret_2M + TXN_std_dev_ret_2M + TXN_avg_vol_2M + TXN_std_dev_vol_2M,
data=SSD_train_2M) %>%
step_center(all_predictors()) %>%
step_scale(all_predictors())
SSD_folds_1W <- vfold_cv(SSD_train_1W, v = 10, strata = NVDA_avg_cl_1W)
SSD_folds_2W <- vfold_cv(SSD_train_2W, v = 10, strata = NVDA_avg_cl_2W)
SSD_folds_1M <- vfold_cv(SSD_train_1M, v = 10, strata = NVDA_avg_cl_1M)
SSD_folds_2M <- vfold_cv(SSD_train_2M, v = 10, strata = NVDA_avg_cl_2M)
# Linear Regression
lm_model <- linear_reg() %>%
set_engine("lm")
# Ridge Regression
ridge_model <- linear_reg(mixture = 0,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Lasso Regression
lasso_model <- linear_reg(mixture = 1,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# Elastic Net
elastic_net_model <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
# k-Nearest Neighbors (k = 7)
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
# Linear Regression Workflows
lm_wflow_1W <- workflow() %>%
add_model(lm_model) %>%
add_recipe(SSD_recipe_1W)
lm_wflow_2W <- workflow() %>%
add_model(lm_model) %>%
add_recipe(SSD_recipe_2W)
lm_wflow_1M <- workflow() %>%
add_model(lm_model) %>%
add_recipe(SSD_recipe_1M)
lm_wflow_2M <- workflow() %>%
add_model(lm_model) %>%
add_recipe(SSD_recipe_2M)
# Ridge Regression Workflows
ridge_wflow_1W <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(SSD_recipe_1W)
ridge_wflow_2W <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(SSD_recipe_2W)
ridge_wflow_1M <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(SSD_recipe_1M)
ridge_wflow_2M <- workflow() %>%
add_model(ridge_model) %>%
add_recipe(SSD_recipe_2M)
# Lasso Regression Workflows
lasso_wflow_1W <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(SSD_recipe_1W)
lasso_wflow_2W <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(SSD_recipe_2W)
lasso_wflow_1M <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(SSD_recipe_1M)
lasso_wflow_2M <- workflow() %>%
add_model(lasso_model) %>%
add_recipe(SSD_recipe_2M)
# Elastic Net Workflows
elastic_net_wflow_1W <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(SSD_recipe_1W)
elastic_net_wflow_2W <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(SSD_recipe_2W)
elastic_net_wflow_1M <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(SSD_recipe_1M)
elastic_net_wflow_2M <- workflow() %>%
add_model(elastic_net_model) %>%
add_recipe(SSD_recipe_2M)
# k-Nearest Neighbors Workflows
knn_wflow_1W <- workflow() %>%
add_model(knn_model) %>%
add_recipe(SSD_recipe_1W)
knn_wflow_2W <- workflow() %>%
add_model(knn_model) %>%
add_recipe(SSD_recipe_2W)
knn_wflow_1M <- workflow() %>%
add_model(knn_model) %>%
add_recipe(SSD_recipe_1M)
knn_wflow_2M <- workflow() %>%
add_model(knn_model) %>%
add_recipe(SSD_recipe_2M)
Set up Grids:
# Grid for Ridge Regression and Lasso Regression
no_mixture_grid <- grid_regular(penalty(range = c(0,1)), levels = 50)
# Grid for Elastic Net
elastic_net_grid <- grid_regular(penalty(range = c(0, 1),
trans = identity_trans()),
mixture(range = c(0, 1)),
levels = 10)
# k-Nearest Neighbors Net
knn_grid <- grid_regular(neighbors(range = c(2,20)), levels = 10)
Tune Parameters for 1-Week Recipe
# Find optimal parameters for ridge regression
ridge_tune_1W <- tune_grid(
ridge_wflow_1W,
resamples = SSD_folds_1W,
grid = no_mixture_grid
)
ridge_final_wflow_1W <- select_best(ridge_tune_1W, metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_1W)
# Find optimal parameters for lasso regression
lasso_tune_1W <- tune_grid(
lasso_wflow_1W,
resamples = SSD_folds_1W,
grid = no_mixture_grid
)
lasso_final_wflow_1W <- select_best(lasso_tune_1W, metric="rmse") %>%
finalize_workflow(x=lasso_wflow_1W)
# Find optimal parameters for Elastic Net
elastic_net_tune_1W <- tune_grid(
elastic_net_wflow_1W,
resamples = SSD_folds_1W,
grid = elastic_net_grid
)
elastic_net_final_wflow_1W <- select_best(elastic_net_tune_1W, metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_1W)
# Find optimal parameters for k-Nearest Neighbors
knn_tune_1W <- tune_grid(
knn_wflow_1W,
resamples = SSD_folds_1W,
grid = knn_grid
)
knn_final_wflow_1W <- select_best(knn_tune_1W, metric = "rmse") %>%
finalize_workflow(x=knn_wflow_1W)
Tune Parameters for 2-Week Recipe
# Find optimal parameters for ridge regression
ridge_tune_2W <- tune_grid(
ridge_wflow_2W,
resamples = SSD_folds_2W,
grid = no_mixture_grid
)
ridge_final_wflow_2W <- select_best(ridge_tune_2W, metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_2W)
# Find optimal parameters for lasso regression
lasso_tune_2W <- tune_grid(
lasso_wflow_2W,
resamples = SSD_folds_2W,
grid = no_mixture_grid
)
lasso_final_wflow_2W <- select_best(lasso_tune_2W, metric="rmse") %>%
finalize_workflow(x=lasso_wflow_2W)
# Find optimal parameters for Elastic Net
elastic_net_tune_2W <- tune_grid(
elastic_net_wflow_2W,
resamples = SSD_folds_2W,
grid = elastic_net_grid
)
elastic_net_final_wflow_2W <- select_best(elastic_net_tune_2W, metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_2W)
# Find optimal parameters for k-Nearest Neighbors
knn_tune_2W <- tune_grid(
knn_wflow_2W,
resamples = SSD_folds_2W,
grid = knn_grid
)
knn_final_wflow_2W <- select_best(knn_tune_2W, metric = "rmse") %>%
finalize_workflow(x=knn_wflow_2W)
Tune Parameters for 1-Month Recipe
# Find optimal parameters for ridge regression
ridge_tune_1M <- tune_grid(
ridge_wflow_1M,
resamples = SSD_folds_1M,
grid = no_mixture_grid
)
ridge_final_wflow_1M <- select_best(ridge_tune_1M, metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_1M)
# Find optimal parameters for lasso regression
lasso_tune_1M <- tune_grid(
lasso_wflow_1M,
resamples = SSD_folds_1M,
grid = no_mixture_grid
)
lasso_final_wflow_1M <- select_best(lasso_tune_1M, metric="rmse") %>%
finalize_workflow(x=lasso_wflow_1M)
# Find optimal parameters for Elastic Net
elastic_net_tune_1M <- tune_grid(
elastic_net_wflow_1M,
resamples = SSD_folds_1M,
grid = elastic_net_grid
)
elastic_net_final_wflow_1M <- select_best(elastic_net_tune_1M, metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_1M)
# Find optimal parameters for k-Nearest Neighbors
knn_tune_1M <- tune_grid(
knn_wflow_1M,
resamples = SSD_folds_1M,
grid = knn_grid
)
knn_final_wflow_1M <- select_best(knn_tune_1M, metric = "rmse") %>%
finalize_workflow(x=knn_wflow_1M)
Tune Parameters for 2-Month Recipe
# Find optimal parameters for ridge regression
ridge_tune_2M <- tune_grid(
ridge_wflow_2M,
resamples = SSD_folds_2M,
grid = no_mixture_grid
)
ridge_final_wflow_2M <- select_best(ridge_tune_2M, metric="rmse" ) %>%
finalize_workflow(x=ridge_wflow_2M)
# Find optimal parameters for lasso regression
lasso_tune_2M <- tune_grid(
lasso_wflow_2M,
resamples = SSD_folds_2M,
grid = no_mixture_grid
)
lasso_final_wflow_2M <-select_best(lasso_tune_2M, metric="rmse") %>%
finalize_workflow(x=lasso_wflow_2M)
# Find optimal parameters for Elastic Net
elastic_net_tune_2M <- tune_grid(
elastic_net_wflow_2M,
resamples = SSD_folds_2M,
grid = elastic_net_grid
)
elastic_net_final_wflow_2M <- select_best(elastic_net_tune_2M, metric = "rmse") %>%
finalize_workflow(x=elastic_net_wflow_2M)
# Find optimal parameters for k-Nearest Neighbors
knn_tune_2M <- tune_grid(
knn_wflow_2M,
resamples = SSD_folds_2M,
grid = knn_grid
)
knn_final_wflow_2M <- select_best(knn_tune_2M, metric = "rmse") %>%
finalize_workflow(x=knn_wflow_2M)
# Linear Regression Fits
lm_fit_1W <- fit(lm_wflow_1W, SSD_train_1W)
lm_fit_2W <- fit(lm_wflow_2W, SSD_train_2W)
lm_fit_1M <- fit(lm_wflow_1M, SSD_train_1M)
lm_fit_2M <- fit(lm_wflow_2M, SSD_train_2M)
# Ridge Regression Fits
ridge_fit_1W <- fit(ridge_final_wflow_1W, SSD_train_1W)
ridge_fit_2W <- fit(ridge_final_wflow_2W, SSD_train_2W)
ridge_fit_1M <- fit(ridge_final_wflow_1M, SSD_train_1M)
ridge_fit_2M <- fit(ridge_final_wflow_2M, SSD_train_2M)
# Lasso Regression Fits
lasso_fit_1W <- fit(lasso_final_wflow_1W, SSD_train_1W)
lasso_fit_2W <- fit(lasso_final_wflow_2W, SSD_train_2W)
lasso_fit_1M <- fit(lasso_final_wflow_1M, SSD_train_1M)
lasso_fit_2M <- fit(lasso_final_wflow_2M, SSD_train_2M)
# Elastic Net Fits
elastic_net_fit_1W <- fit(elastic_net_final_wflow_1W, SSD_train_1W)
elastic_net_fit_2W <- fit(elastic_net_final_wflow_2W, SSD_train_2W)
elastic_net_fit_1M <- fit(elastic_net_final_wflow_1M, SSD_train_1M)
elastic_net_fit_2M <- fit(elastic_net_final_wflow_2M, SSD_train_2M)
# k-Nearest Neighbors Fit
knn_fit_1W <- fit(knn_final_wflow_1W, SSD_train_1W)
knn_fit_2W <- fit(knn_final_wflow_2W, SSD_train_2W)
knn_fit_1M <- fit(knn_final_wflow_1M, SSD_train_1M)
knn_fit_2M <- fit(knn_final_wflow_2M, SSD_train_2M)
# Linear Regression Training
lm_train_res_1W <- predict(lm_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
lm_train_res_1W <- bind_cols(lm_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))
lm_train_res_2W <- predict(lm_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
lm_train_res_2W <- bind_cols(lm_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))
lm_train_res_1M <- predict(lm_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
lm_train_res_1M <- bind_cols(lm_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))
lm_train_res_2M <- predict(lm_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
lm_train_res_2M <- bind_cols(lm_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))
# Ridge Regression Training
ridge_train_res_1W <- predict(ridge_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
ridge_train_res_1W <- bind_cols(ridge_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))
ridge_train_res_2W <- predict(ridge_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
ridge_train_res_2W <- bind_cols(ridge_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))
ridge_train_res_1M <- predict(ridge_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
ridge_train_res_1M <- bind_cols(ridge_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))
ridge_train_res_2M <- predict(ridge_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
ridge_train_res_2M <- bind_cols(ridge_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))
# Lasso Regression Training
lasso_train_res_1W <- predict(lasso_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
lasso_train_res_1W <- bind_cols(lasso_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))
lasso_train_res_2W <- predict(lasso_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
lasso_train_res_2W <- bind_cols(lasso_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))
lasso_train_res_1M <- predict(lasso_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
lasso_train_res_1M <- bind_cols(lasso_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))
lasso_train_res_2M <- predict(lasso_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
lasso_train_res_2M <- bind_cols(lasso_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))
# Elastic Net Training
elastic_net_train_res_1W <- predict(elastic_net_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
elastic_net_train_res_1W <- bind_cols(elastic_net_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))
elastic_net_train_res_2W <- predict(elastic_net_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
elastic_net_train_res_2W <- bind_cols(elastic_net_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))
elastic_net_train_res_1M <- predict(elastic_net_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
elastic_net_train_res_1M <- bind_cols(elastic_net_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))
elastic_net_train_res_2M <- predict(elastic_net_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
elastic_net_train_res_2M <- bind_cols(elastic_net_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))
# k-Nearest Neighbors Training
knn_train_res_1W <- predict(knn_fit_1W, new_data = SSD_train_1W %>% select(-NVDA_avg_cl_1W))
knn_train_res_1W <- bind_cols(knn_train_res_1W, SSD_train_1W %>% select(NVDA_avg_cl_1W))
knn_train_res_2W <- predict(knn_fit_2W, new_data = SSD_train_2W %>% select(-NVDA_avg_cl_2W))
knn_train_res_2W <- bind_cols(knn_train_res_2W, SSD_train_2W %>% select(NVDA_avg_cl_2W))
knn_train_res_1M <- predict(knn_fit_1M, new_data = SSD_train_1M %>% select(-NVDA_avg_cl_1M))
knn_train_res_1M <- bind_cols(knn_train_res_1M, SSD_train_1M %>% select(NVDA_avg_cl_1M))
knn_train_res_2M <- predict(knn_fit_2M, new_data = SSD_train_2M %>% select(-NVDA_avg_cl_2M))
knn_train_res_2M <- bind_cols(knn_train_res_2M, SSD_train_2M %>% select(NVDA_avg_cl_2M))
Root Mean Square Error (RMSE) results:
tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors"),
One_Week = c((lm_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
(ridge_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
(lasso_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
(elastic_net_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate,
(knn_train_res_1W %>% rmse( NVDA_avg_cl_1W, .pred))$.estimate ),
Two_Week = c((lm_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
(ridge_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
(lasso_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
(elastic_net_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate,
(knn_train_res_2W %>% rmse( NVDA_avg_cl_2W, .pred))$.estimate),
One_Month = c((lm_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
(ridge_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
(lasso_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
(elastic_net_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate,
(knn_train_res_1M %>% rmse( NVDA_avg_cl_1M, .pred))$.estimate),
Two_Month = c((lm_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
(ridge_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
(lasso_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
(elastic_net_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate,
(knn_train_res_2M %>% rmse( NVDA_avg_cl_2M, .pred))$.estimate)
) %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| Model | One_Week | Two_Week | One_Month | Two_Month |
|---|---|---|---|---|
| Linear Regression | 16.836273 | 11.6061295 | 5.869688 | 1.9282477 |
| Ridge Regression | 22.645078 | 17.1559196 | 11.596960 | 7.2480449 |
| Lasso Regression | 21.314497 | 16.8026942 | 12.539594 | 8.8715326 |
| Elastic Net | 16.914634 | 11.8368996 | 6.321052 | 4.7157312 |
| k-Nearest Neighbors | 1.435098 | 0.7352442 | 0.571276 | 0.4497364 |
R^2 results:
tibble(Model = c("Linear Regression", "Ridge Regression", "Lasso Regression", "Elastic Net", "k-Nearest Neighbors"),
One_Week = c((lm_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
(ridge_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
(lasso_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
(elastic_net_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate,
(knn_train_res_1W %>% rsq( NVDA_avg_cl_1W, .pred))$.estimate ),
Two_Week = c((lm_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
(ridge_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
(lasso_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
(elastic_net_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate,
(knn_train_res_2W %>% rsq( NVDA_avg_cl_2W, .pred))$.estimate),
One_Month = c((lm_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
(ridge_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
(lasso_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
(elastic_net_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate,
(knn_train_res_1M %>% rsq( NVDA_avg_cl_1M, .pred))$.estimate),
Two_Month = c((lm_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
(ridge_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
(lasso_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
(elastic_net_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate,
(knn_train_res_2M %>% rsq( NVDA_avg_cl_2M, .pred))$.estimate)
) %>%
kable() %>%
kable_styling(full_width = F) %>%
scroll_box(width = "100%", height = "200px")
| Model | One_Week | Two_Week | One_Month | Two_Month |
|---|---|---|---|---|
| Linear Regression | 0.9914555 | 0.9959253 | 0.9988330 | 0.9998424 |
| Ridge Regression | 0.9848118 | 0.9913159 | 0.9956269 | 0.9979111 |
| Lasso Regression | 0.9863868 | 0.9915371 | 0.9947582 | 0.9967498 |
| Elastic Net | 0.9913760 | 0.9957622 | 0.9986471 | 0.9990764 |
| k-Nearest Neighbors | 0.9999381 | 0.9999837 | 0.9999890 | 0.9999915 |